Term frequency inverse document frequency (TF-IDF) is both a mouthful and a process often carried out as part of the text mining approach. The primary idea behind TF-IDF is to find the words which are most important for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection as a whole. Essentially it tries to find words that are important (i.e, common) in a text but not too common. For example, if there were two identical texts with only a single word difference between the two then a tf-idf approach would isolate that single word as the most important when comparing the two texts.
# A tibble: 38,788 × 4
Lincolnshire_ID word n total
<chr> <chr> <int> <int>
1 1016 of 131 657
2 1016 from 66 657
3 1016 the 66 657
4 1016 wool 66 657
5 1016 vill 61 657
6 1016 stones 46 657
7 1130 the 46 490
8 1142 and 29 462
9 1044 of 28 245
10 1133 the 27 254
# ℹ 38,778 more rows
This creates a table (lincs_words) with one row for each word-allegation combination. n is the number of times that work is used in that particular allegation and total is the total number of words within that particular allegation. Above you can see the first ten rows from the resulting dataframe.
# A tibble: 38,788 × 6
Lincolnshire_ID word n total rank `term frequency`
<chr> <chr> <int> <int> <int> <dbl>
1 1016 of 131 657 1 0.199
2 1016 from 66 657 2 0.100
3 1016 the 66 657 3 0.100
4 1016 wool 66 657 4 0.100
5 1016 vill 61 657 5 0.0928
6 1016 stones 46 657 6 0.0700
7 1130 the 46 490 1 0.0939
8 1142 and 29 462 1 0.0628
9 1044 of 28 245 1 0.114
10 1133 the 27 254 1 0.106
# ℹ 38,778 more rows
Now I can calculate the frequency of each term across the entire document. This is carried out by dividing n - which is the number of times a particular word appears in each allegation - by the total number of times that word appears in the whole court roll. Again, above is a random 10 row selection from the dataframe which results from these calculations.
# A tibble: 38,788 × 7
Lincolnshire_ID word n total tf idf tf_idf
<chr> <chr> <int> <int> <dbl> <dbl> <dbl>
1 1016 of 131 657 0.199 0.0572 0.0114
2 1016 from 66 657 0.100 0.486 0.0488
3 1016 the 66 657 0.100 0.285 0.0286
4 1016 wool 66 657 0.100 1.78 0.179
5 1016 vill 61 657 0.0928 1.39 0.129
6 1016 stones 46 657 0.0700 2.41 0.168
7 1130 the 46 490 0.0939 0.285 0.0267
8 1142 and 29 462 0.0628 0.332 0.0208
9 1044 of 28 245 0.114 0.0572 0.00654
10 1133 the 27 254 0.106 0.285 0.0303
# ℹ 38,778 more rows
The tf_idf will be close to zero for the most common words. These are the words which appear in most allegations. From the first ten rows ‘of’ and ‘the’ are the lowest. The highest values will be those words which occur in fewer allegations.
It is unsurprising that almost all of the words which are determined as the most important are place or personal names. A further delving shows that some occupations are also determined as important with ‘fletcher’ occurring twice and foodstuffs ‘pork’ and ‘poultry’. It may be that the text within each allegation is often too small to adequately characterise using tf-idf. This text is based on a calendared edition. In other words, it has omitted much of the repetitive legalese which predominates in fourteenth century court documents (perhaps all court documents). This aids readability but might hinder analyses such as this one.
#Removes entries containing personal names, place names, digits, and common english words.stop_tf_idf <- lincs_tf_idf %>%anti_join(lincs_stop_people)stop_tf_idf <- stop_tf_idf %>%anti_join(lincs_stop_places)stop_tf_idf <- stop_tf_idf %>%filter(grepl('^\\D', word))stop_tf_idf <- stop_tf_idf %>%anti_join(stop_words)stop_tf_idf %>%select(-total) %>%arrange(desc(tf_idf))
Removing those words which refer to locations, individuals, or digits (primarily the index create by Bernard McLane the original editor) gives a word list of much more interest. This contains words which indicate a legal process such as ‘maintainer’ (someone who facilitates lawsuits for third parties to harass others), ‘forestaller’ (an individual who buys goods in anticipation of rising prices so that they might resell them at a profit.) It also includes words which are indicative of the material involved in the allegations, ‘sheep’, ‘chickens’, and professions, ‘archer’, ‘bookbinder’, ‘oilmaker’. A list like this could be the beginning of a tailored classification which sought to annotate professions, evidence of the posessions of victims of crime, or the prevalence of particular types of judicial processes. I suspect that tf-idf might be even more useful when comparing between larger bodies of text such as between different courts rather than internal comparisons of the sort I have carried out here.